VAST challenge 2021 MC2
In the roughly twenty years that Tethys-based GAStech has been operating a natural gas production site in the island country of Kronos, it has produced remarkable profits and developed strong relationships with the government of Kronos. However, GAStech has not been as successful in demonstrating environmental stewardship.
In January, 2014, the leaders of GAStech are celebrating their new-found fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected in the disappearance, but things may not be what they seem.
As an expert in visual analytics, you are called in to help law enforcement from Kronos and Tethys
Mini-Challenge 2 asks you to analyze movement and tracking data. GAStech provides many of their employees with company cars for their personal and professional use, but unbeknownst to the employees, the cars are equipped with GPS tracking devices. You are given tracking data for the two weeks leading up to the disappearance, as well as credit card transactions and loyalty card usage data. From this data, can you identify anomalies and suspicious behaviors? Can you identify which people use which credit and loyalty cards?
The questions themselves offer an actionable approach,which guide us to solve the whole task incrementally.
First question require us to detect the anomalies with respect to the location and time based on the cc_data and loyalty_data. This step can also help us to pinpoint the pair of credit card and loyalty card, which are owned by the same employee.SO this underpins the third question.
Second question hints us to screen out the corresponding Car ID , the owner of who have the abnormal consumption pattern(time and location anomalies) derived from the first question. Further, we can induce the specific employee name baes on the cars-assignment data.
The third question is on top of answers of the first and second question.
Likewise, the answer of fourth question is also contained with the previous three questions.
Packages to be loaded:
packages = c('raster', 'sf',
'tmap', 'clock', 'lubridate',
'tidyverse','plotly','DT','patchwork','htmltools','mapview')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
cc_data
cc_data <- read_csv("MC2/cc_data.csv")
glimpse(cc_data)
Rows: 1,490
Columns: 4
$ timestamp <chr> "01/06/2014 07:28", "01/06/2014 07:34", "01/06/20~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
cc_data$timestamp <- date_time_parse(cc_data$timestamp,
zone = "",
format = "%m/%d/%Y %H:%M")
cc_data$location <- as_factor(cc_data$location)
cc_data$last4ccnum <- as_factor(cc_data$last4ccnum)
cc_data_trans <- cc_data%>%
mutate(day = as_factor(get_day(timestamp)),
hour = as_factor(get_hour(timestamp)),
time = format(timestamp, format = "%H:%M"))
loyalty data
The data preparation of cc_data can also be applied to the loyalty_data as below:
loyalty_data <- read_csv("MC2/loyalty_data.csv")
loyalty_data$timestamp <- date_time_parse(loyalty_data$timestamp,
zone = "",
format = "%m/%d/%Y")
loyalty_data$location <- as_factor(loyalty_data$location)
loyalty_data$loyaltynum <- as_factor(loyalty_data$loyaltynum)
loyalty_data_trans <- loyalty_data%>%
mutate(day = get_day(timestamp))
One more particular data anomaly also need to be taken care of for both dataframes—‘Katerina<U+0092>s Caf
Now let’s put the two dataframes together to make a comparison:
glimpse(cc_data_trans)
Rows: 1,490
Columns: 7
$ timestamp <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <fct> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ day <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
$ hour <fct> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7~
$ time <chr> "07:28", "07:34", "07:35", "07:36", "07:37", "07:~
glimpse(loyalty_data_trans)
Rows: 1,392
Columns: 5
$ timestamp <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <fct> L2247, L9406, L8328, L6417, L1107, L4034, L6110, ~
$ day <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6~
Observations:
1. Ideally, one credit card record should correspond to one loyalty card record with exact date(not time),location and price.If this is the case,we can derive the pair of credit card and loyalty card owned by the same employee,by (outer) joining these two tables on the very three fields. 2. However, loyalty_data_trans has less rows than cc_data_trans.The reason maybe that not both credit card and loyalty card are supported for all of business locations or card holders didn’t make consumption by using both credit card and loyatly card.
Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
#Tidy and transform the gps_data.csv
gps$Timestamp <- date_time_parse(gps$Timestamp,
zone = "",
format = "%m/%d/%Y %H:%M:%S")
gps$id <- as_factor(gps$id)
gps<- gps %>%
mutate(weekday=wday(Timestamp),
monthday = as_factor(get_day(Timestamp)),
hour = get_hour(Timestamp))
#set as sf object
gps_sf <- st_as_sf(gps,
coords = c("long", "lat"),
crs = 4326)
#Transform from point object into linestring (path)
gps_path <- gps_sf %>%
group_by(id,monthday,weekday,hour) %>%
summarize(m =mean(Timestamp),
do_union=FALSE) %>%
st_cast("LINESTRING")
bgmap <- raster("MC2/MC2-tourist_modified.tif")
ggplot(cc_data_trans,aes(y=fct_rev(fct_infreq(location))))+
geom_bar()+
labs(y = "Business locations")+
ggtitle('Location popularity based on the credit card consupmtion records')
ggplot(loyalty_data_trans, aes(y=fct_rev(fct_infreq(location))))+
geom_bar()+
labs(y = "Business locations")+
ggtitle('Location popularity based on the loyalty card consupmtion records')
Observation 1:
From the bar chart above,on the one hand we can roughly summarize those most 6 popular locations are Katerinas Cafe,Hippokampos,Guy’s Gyros, Brew’ve Been Served, Hallowed Grounds,Quzeri Elian, Which make great sense since they are basiclly the cateens and all near the office of GAS tech building.On the other hand, those less popular places also need to be taken note of such as Kronos Pipe and Irrigation, Abila Scrapyard,Maximum Iron and Steel etc, which are quite abnormal relative to one employee’s daily life.
One particular anomaly is the place ‘Daily Dealz’ which i can’t find corresponding location in the Abila map.Also there is no such place in the loyalty data. So probably this place is lying outside of Abila.
Compared the above bar chart with that based on credit card consumption, we may find the location frequency shows almost the same pattern,which also make great sense.To find more abnormal patterns, we will make use of heatmap in the next.
#create a consumption_table for the purpose of ordering the last4ccnum by their consumption amount
cc_consumption_table <- cc_data_trans%>%
group_by(last4ccnum)%>%
summarise(consumption=sum(price))
level<- cc_consumption_table$last4ccnum[
order(cc_consumption_table$consumption, decreasing = T)]
cc_data_trans$last4ccnum <- factor(cc_data_trans$last4ccnum,
levels = level)
cc_data_trans$location=factor(fct_infreq(cc_data_trans$location))
#creat heatmap
cc_hm<- plot_ly(data=cc_data_trans,
x=~last4ccnum,
y=~location,
color=~price)%>%
layout(title = 'Heatmap of credit card with varying card holders')
cc_hm
Reference datatable for cc_data
dt<- datatable(cc_data_trans,filter='top')%>%
formatStyle(0,target = 'row')
dt
Now we change the x-axis of above heatmap from ‘last4ccnum’ to ‘hour’:
hour_hm<- plot_ly(data=cc_data_trans,
x= ~hour,y=~location,color=~price )%>%
layout(title = 'Heatmap of credit card with varying hours')
hour_hm
Next we plot the heatmap of loyalty card
* loyalty_data
#create a consumption_table for the purpose of ordering the loyaltynum by their consumption amount
loyalty_consumption_table<- loyalty_data_trans%>%
group_by(loyaltynum)%>%
summarise(consumption=sum(price))
level<- loyalty_consumption_table$loyaltynum[order(loyalty_consumption_table$consumption,decreasing = T)]
loyalty_data_trans$loyaltynum <- factor(loyalty_data_trans$loyaltynum, levels = level)
loyalty_data_trans$location=factor(fct_infreq(loyalty_data_trans$location))
# create heatmap
loyalty_hm<- plot_ly(data=loyalty_data_trans,
x= ~loyaltynum,y=~location,color=~price)%>%
layout(title = 'Heatmap of loyalty card with varying card holders')
loyalty_hm
Reference datatable for loyalty_data
dt<- datatable(loyalty_data_trans,filter='top')%>%
formatStyle(0,target = 'row')
dt
Note: For above 3 heatmaps, the ‘location’ on the Y-axis is in descending order based on their frequency while the card num(credit or loyalty) on the X-axis is in descending order based on the sum of price for each card num within each data.
Observation2:
Above two heatmaps also provide us with an alternative to derive the pair of credit and loyalty card, though there may be uncertainties. For example, since the record taking place in Abila Scrapyard corresponds to only one loyalty and credit card num, we can be very confident that L3317(loyalty) and 2276(credit)has the same owner.
Below are the detected anomalies:
1. Location anomaly The consumption of card holders(include the credit card and loyalty card) lying at left of X-axis happen and almost solely happen at those abnormal locations. The place includes National Refinery,Stewart and Sons Fabrication,Carlyle Chemical Inc etc, in sharp contrast to the Katerinas Cafe, Hippokampos,Guy’s Gyros or some other canteens where most of consumption recorded. So this is very suspicious. 2. Price amount anomaly The records happened at those abnormal places(mentioned above) also contain much larger consumption amount as opposed to other places.
3. Time anomaly
From the above heatmap with varying hour, we can easily find below two unusual phenomenons: a. 5 consumption records took place at kronos Mart at 3-4AM, which is Early morning b. 5 consumption records took place at Frydos Autosupply n’ More at 9-10 PM, which is abnormal relative the identity of place(seems like one car accessories shop )
The code is as below:
#gps_path_selected <- gps_path2 %>%
# filter(id==10)
tmap_mode("view")
tm_shape(bgmap)+
tm_rgb(bgmap, r=1, g=2, b=3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path2) +
tm_lines()+
tm_lines('hour',
style='fixed',breaks=seq(0,24,by=4),
palette = "Blues",
popup.vars=c('hour','monthday','weekday'))+
tm_facets(by = "id", nclo=4,free.coords = FALSE)
Given the large knit load, i didn’t output above code results
We can infer the pair of credit card and loyalty car based on the cc_data and loyalty_data and also the corresponding car_id after joining the gps data.